Introduction

This report performs differential gene expression analysis and basic over-representation (thresholded gene set enrichment) analysis on expression data retrieved from GSE129943. The publication that produced this dataset can be accessed on the Cell website (Dvela-Levitt et al., 2019).This report builds on top of “Assignment 1: Dataset Selection and Initial Processing,” which can be downloaded and read here. The previous report cleaned the bulk RNASeq data from GSE129943, removing any low-count and duplicated genes. Additionally, the previous report performed TMM normalization on the read counts, visualized the normalized counts using MDS plots, mean-variance plots, and distribution curves. The final dataset produced by the report had 14,241 genes and six samples, three of which are replicates of normal human kidney cells and three of which are replicates of Muc1 Kidney Disease (MKD) cells. This report will be comparing the gene expression between normal kidney cells and MKD cells to identify enriched pathways, so as to better understand the mechanisms of MKD.

Acknowledgements

This R Notebook was developed by Evgeniya Gorobets as part of an assessment for 2022 BCB420H: Computational Systems Biology, University of Toronto, Toronto, CA. Specifically, this notebook was the final submitted product for Assignment 2 of the course. The journal entries accompanying this R Notebook are listed below.

Contributions

This report was created in R (R Core Team, 2021) with the knitr Xie (2017) and kableExtra packages (Zhu, 2021). edgeR (Chen et al., 2016; McCarthy et al., 2012; Robinson et al., 2010) was used to estimate dispersion and perform differential gene expression analysis. ggplot2 (Wickham, 2016) was used to create the volcano plots in this report. ComplexHeatmap (Gu et al., 2016) and circlize (Gu et al., 2014) were used to create the heatmaps in this report. The gprofiler2 package (Kolberg et al., 2020), which is built on top of the g:Profiler API (Raudvere et al., 2019), was used to perform over-representation analysis. ORA plots were generated with gprofiler2 (Kolberg et al., 2020) and plotly (Sievert, 2020). The report was created inside a Docker image (Merkel, 2014) whose base is Rocker (Boettiger & Eddelbuettel, 2017).

Set-Up

# Install required packages
if (!requireNamespace("kableExtra", quietly = TRUE)) {
  install.packages("kableExtra")
}
if (!requireNamespace("edgeR", quietly = TRUE)) {
  install.packages("edgeR")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}
if(!requireNamespace("gprofiler2", quietly = TRUE)) {
  install.packages("gprofiler2")
}
if (!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
  install.packages("ComplexHeatmap")
}
# circlize::colorRamp2 required for continuous color scaling in ComplexHeatmap
if (!requireNamespace("circlize", quietly = TRUE)) {
  install.packages("circlize")
}
if (!requireNamespace("plotly", quietly = TRUE)) {
  install.packages("plotly")
}

dataDir <- '../data'

# Source A1 to get normalized counts
a1 <- knitr:: knit_child('../Assignment1/Assignment1.Rmd', quiet=T)
expressionSet <- normalizedCounts


Differential Gene Expression Analysis

Design Matrix

The MDS plot from Assignment 1 shows that all the MKD samples were highly clustered in both dimensions and all the control (normal kidney) samples were clustered in one dimension. Additionally, the publication that produced this data set (Dvela-Levitt et al., 2019) stated that there was no biological relation between the control and test samples. Therefore, the only factor in the design matrix is whether each sample is from a normal kidney or an MKD patient.

# Convert the normalized counts into a matrix
counts <- expressionSet[3:8]
rownames(counts) <- expressionSet$Ensembl

# Create a data.frame containing the type (normal/MKD) of each sample
sampleTypes <- unlist(lapply(colnames(counts), function(sampleName) {
  strsplit(sampleName, "_")[[1]][1]
})) 
samples <- data.frame(sample = colnames(counts), type = sampleTypes)

# Create a design matrix
designMatrix <- model.matrix(~ samples$type)
kableExtra::kable_styling(knitr::kable(
  designMatrix, type="html", 
  col.names = c("Intercept", "Samples.TypeP1A8"),
  caption=paste("The design matrix for the GSE129943 dataset.",
                "The Samples.TypeP1A8 column represents which of",
                "the samples is derived from an MKD patient.")),
  "condensed")
Table 1: The design matrix for the GSE129943 dataset. The Samples.TypeP1A8 column represents which of the samples is derived from an MKD patient.
Intercept Samples.TypeP1A8
1 0
1 0
1 0
1 1
1 1
1 1

The design matrix shown in Table 1 clearly demonstrates that three of the samples are of type P1A8 (i.e., MKD cells).

Differential Analysis

Now that the design matrix is defined, I can proceed with the differential gene expression analysis. I will use the negative binomial generalized linear model and the quasi-likelihood test, both implemented in the edgeR package (Chen et al., 2016; McCarthy et al., 2012; Robinson et al., 2010). I chose to use the edgeR package because it was designed specifically for bulk RNASeq data.

# Set up a DGEList using our counts matrix and design matrix
counts <- as.matrix(counts)
dge <- edgeR::DGEList(counts=counts, group=samples$type)
dge <- edgeR::estimateDisp(dge, designMatrix)

# Fit the model using glmGLFit
fit <- edgeR::glmQLFit(dge, designMatrix)

# Compute differential expression and extract results into data.frame
qlfResults <- edgeR::glmQLFTest(fit, coef='samples$typeP1A8')

The next step is to identify the genes that were significantly differentially expressed. We are going to explore the number of DEGs under four different thresholds (0.05, 0.01, 0.005, 0.001) so that we may choose the p-value that is most appropriate for our data.

pvals <- c(0.05, 0.01, 0.005, 0.001)
numSignificant <- unlist(lapply(pvals, function(pval) {
  length(which(qlfResults$table$PValue < pval))
}))
kableExtra::kable_styling(knitr::kable(
  data.frame(`P-Value`=pvals, `Num. Significant Genes`=numSignificant, 
             check.names=F),
  caption=paste("The number of significantly differentially",
                "expressed genes in the GSE129943 dataset, based",
                "on four different p-value thresholds."),
  type="html"), "condensed")
Table 2: The number of significantly differentially expressed genes in the GSE129943 dataset, based on four different p-value thresholds.
P-Value Num. Significant Genes
0.050 5597
0.010 4237
0.005 3851
0.001 3072

Table 2 shows that there are a large amount of significant DEGs, even when using a p-value as small as 0.001. This suggests a stronger threshold should be used to generate the thresholded gene lists for gene set enrichment analysis. However, before deciding a threshold, we must first adjust our p-values to correct for Multiple Hypothesis Testing.

Multiple Hypothesis Testing

Because we are testing for differential expression in thousands of genes (14,241 genes, to be precise), and each gene has its own null hypothesis that is being tested ( $H_0^{(X)} = $ gene \(X\) is not differentially expressed), we must consider the effects of Multiple Hypothesis Testing. Because we are testing so many hypotheses simultaneously, there is a much greater chance of false positives (Type I errors). In other words, there is a higher probability that a gene that is not differentially expressed will still have a p-value less than the significance threshold.

To correct for this, I will use the Benjamini-Hochberg procedure to compute a BH-adjusted p-value, also known as the False Discovery Rate (FDR). I am the Benjamini-Hochberg adjustment method because it is the most commonly used in recent literature and because it is less stringent than other correction methods (e.g., Bonferroni).

As above, we’re going to find the number of genes with significant differential expression under four different thresholds: 0.05, 0.01, 0.005, 0.001.

qlfAdjusted <- edgeR::topTags(qlfResults, n=nrow(counts), adjust.method="BH", 
                              sort.by="PValue")

numSigAdj <- unlist(lapply(pvals, function(pval) {
  length(which(qlfAdjusted$table$FDR < pval))
}))
kableExtra::kable_styling(knitr::kable(
  data.frame(`Adjusted P-Value (FDR)`=pvals, `Num. Significant Genes`=numSigAdj, 
              check.names=F), 
  caption=paste("The number of significantly differentially",
                "expressed genes in the GSE129943 dataset after",
                "Benjaminig-Hochberg correction, based on four",
                "different FDR thresholds."),
  type="html"), "condensed")
Table 3: The number of significantly differentially expressed genes in the GSE129943 dataset after Benjaminig-Hochberg correction, based on four different FDR thresholds.
Adjusted P-Value (FDR) Num. Significant Genes
0.050 4602
0.010 3457
0.005 3103
0.001 2486

Even with adjustment for Multiple Hypothesis Testing, more than 2000 genes had a significant change in expression, even with the strictest threshold of 0.001 (Table 3).

Let us consider two thresholds (0.05 and 0.001) and the DEG list that would be produced by each threshold. The DEGs produced by the stricter threshold (0.001) would have stronger signals, so the gene set enrichment analysis would reveal only the most strongly up-regulated or down-regulated pathways. This may help identify the most important biological processes in MKD cells, but may miss out on weakly-regulated pathways that still offer key insights to the mechanisms of MKD cells. On the other hand, gene set enrichment analysis on the DEG list produced by the weaker threshold (0.05) will likely reveal more pathways, but may fail to highlight the most important ones. I believe that it is important to conduct gene set enrichment analysis on both DEG lists, to get both a “deep” view and a “broad” view of the enriched pathways. Therefore, I will subset the qlfResults based on an adjusted p-value of 0.05 and 0.001. Since enough genes pass correction, moving forward I will only consider genes with a significant FDR (i.e., I will not subset or analyze based on the un-adjusted p-values).

# Subset QLF results based on both p-values
sigResults <- qlfAdjusted$table[qlfAdjusted$table$FDR < 0.05, ]
verySigResults <- qlfAdjusted$table[qlfAdjusted$table$FDR < 0.001, ]

# Get the up and down-regulated genes
upRegGenes <- rownames(sigResults[sigResults$logFC > 0, ])
veryUpRegGenes <- rownames(verySigResults[verySigResults$logFC > 0, ])

downRegGenes <- rownames(sigResults[sigResults$logFC < 0, ])
veryDownRegGenes <- rownames(verySigResults[verySigResults$logFC < 0, ])

# Save all the gene lists
geneLists <- list("upreg_genelist_01" = upRegGenes,
                  "downreg_genelist_01" = downRegGenes, 
                  "upreg_genelist_001" = veryUpRegGenes, 
                  "downreg_genelist_001" = veryDownRegGenes)

for (geneList in names(geneLists)) {
  write.table(geneLists[[geneList]], 
              file=paste0(dataDir, "/", geneList, ".txt"), 
              quote=F, sep="\n", row.names=F, col.names=F)
}


Visualizing DEGs

Volcano Plots

First, we will use a volcano plot to visualize our DEGs relative to our full gene list.

# Copy relevant data from QLF Test results
fdrs <- qlfAdjusted$table[, c("logFC", "FDR")]

# Add a column that indicates whether each gene is not a DEG, a DEG but only 
# with threshold=0.05 (significant), or a DEG with threshold=0.001 (very sig.)
# and whether it is up/down regulated
# NOTE: use strings instead of numbers so ggplot2 knows it is discrete
fdrs$significance <- "0"
fdrs$significance[fdrs$FDR < 0.05 & fdrs$logFC > 0] <- "11"
fdrs$significance[fdrs$FDR < 0.05 & fdrs$logFC < 0] <- "12"
fdrs$significance[fdrs$FDR < 0.001 & fdrs$logFC > 0] <- "21"
fdrs$significance[fdrs$FDR < 0.001 & fdrs$logFC < 0] <- "22"

# Add a column with log10(FDR)
fdrs$logFDR <- -log10(fdrs$FDR)


# Build volcano plot
plot <- ggplot2::ggplot(fdrs, ggplot2::aes(x=logFC, y=logFDR, col=significance))

# Add scatter points and color appropriately
plot <- plot + ggplot2::geom_point() + 
  ggplot2::scale_color_manual(
    name = "Significance and\nExpression Change",
    labels=c("FDR >= 0.05", 
             "0.001 <= FDR < 0.05; upregulated", 
             "0.001 <= FDR < 0.05; downregulated", 
             "FDR < 0.001; upregulated",
             "FDR < 0.001; downregulated"),
    values=c("gray", "pink", "turquoise2", "red", "blue"))

# Define styling and significance lines and add them to the plot
plotStyle <- ggplot2::theme(
  plot.title = ggplot2::element_text(hjust = 0.5),
  panel.background = ggplot2::element_rect(fill="white", color="black"),
  panel.grid.major = ggplot2::element_line(colour="#eeeeee"),
  panel.grid.minor = ggplot2::element_line(colour="#eeeeee"))

fdr05Line <- ggplot2::geom_hline(yintercept=-log10(0.05), linetype='dotted', 
                                 col = 'black', size=1)
fdr001Line <- ggplot2::geom_hline(yintercept=-log10(0.001), linetype='dotted', 
                                  col = 'black', size=1)

plot <- plot + plotStyle + fdr05Line + fdr001Line
  
# Add title, axis labels
xLabel <- "Log2 FC"
yLabel <- "-Log10 FDR"
title <- "DEGs in GSE129943 (-logFDR vs. logFC)"
plot <- plot + 
  ggplot2::ggtitle(title) + ggplot2::ylab(yLabel) + ggplot2::xlab(xLabel)
  
plot
A volcano plot showing the expression change (log2FC) and significance of the expression change after BH correction (-log10FDR) for all genes in the cleaned GSE129943 dataset.

Figure 1: A volcano plot showing the expression change (log2FC) and significance of the expression change after BH correction (-log10FDR) for all genes in the cleaned GSE129943 dataset.

The volcano plot in Figure 1 is fairly symmetrical, indicating that there are roughly an equal number of up-regulated and down-regulated genes in MKD cells (relative to normal kidney cells). This plot also visualizes what was already demonstrated numerically: that the number of very significantly (FDR < 0.001) differentially expressed genes is quite large (2486 genes out of 14,241, i.e., an astonishing 17.46%). Moreover, more than half of significantly regulated genes (FDR < 0.05) are very significantly regulated (2486 genes out of 4602 = 54%); this is shown by the proportion of red dots to pink dots on the volcano plot.

There are three genes highlighted in the study that produced this data: MUC1, ATF6, and TMED9 (Dvela-Levitt et al., 2019). MUC1 is the gene responsible for MKD (MKD = Mucin 1 Kidney Disease) (Dvela-Levitt et al., 2019). A frame shift in MUC1 (producing the mutant MUC1-fs allele) causes a misfolded version of the Mucin 1 protein to accumulate in cells (Dvela-Levitt et al., 2019). ATF6 is the gene that encodes Activating Transcription Factor 6. This TF activates one of the three branches of unfolded protein response (UPR), which is a cellular response that helps clear misfolded proteins from the cell and maintain ER homeostasis (Dvela-Levitt et al., 2019). In the original study, the authors compared this RNASeq data with a published transcriptional signature of ATF6-specific UPR activation (Adamson et al., 2016) and concluded that the ATF6 branch is significantly up-regulated (Dvela-Levitt et al., 2019). The final gene of interest is TMED9, which encodes a cargo receptor in the p24 family that is found on vesicles in the early secretory pathway (Dvela-Levitt et al., 2019). According to the Dvela-Levitt et al., TMED9-containing vesicles trap misfolded MUC1 proteins, preventing them from lysosomic degradation; thus, TMED9 plays a key role in the development of MKD (Dvela-Levitt et al., 2019). In the study, Dvela-Levitt et al. performed immuno-fluorescence (IF) microscopy on normal kidney organoids and MKD patient-derived organoids and found that TMED9 was significantly more abundant in patient-derived organoids (Dvela-Levitt et al., 2019).

Because of the critical roles of MUC1, ATF6, and TMED9 in MKD, it is important to visualize these genes in our normalized expression data.

# Get Ensembl gene IDs for each gene of interest
muc1 <- expressionSet$Ensembl[which(expressionSet$HGNC == "MUC1")]
atf6 <- expressionSet$Ensembl[which(expressionSet$HGNC == "ATF6")]
tmed9 <- expressionSet$Ensembl[which(expressionSet$HGNC == "TMED9")]

# Label the genes of interest so that they have their own colour
fdrs$`Genes of Interest` <- "0"
fdrs[muc1, "Genes of Interest"] <- "1"
fdrs[atf6, "Genes of Interest"] <- "2"
fdrs[tmed9, "Genes of Interest"] <- "3"
# Sort so that the colored points are plotted last
fdrs <- fdrs[order(fdrs$`Genes of Interest`, decreasing = F), ]


# Build new volcano plot
plot <- ggplot2::ggplot(fdrs, ggplot2::aes(x=logFC, y=logFDR, 
                                           col=`Genes of Interest`, 
                                           alpha=`Genes of Interest`))

# Add scatter points and color appropriately
plot <- plot + ggplot2::geom_point() + 
  ggplot2::scale_color_manual(labels=c("Other genes", "MUC1", "ATF6", "TMED9"),
                              values=c("gray", "red", "blue", "green")) + 
  ggplot2::scale_alpha_manual(labels=c("Other genes", "MUC1", "ATF6", "TMED9"),
                              values=c(0.1, 1.0, 1.0, 1.0))

# Add threshold lines and styling
plot <- plot + plotStyle + fdr05Line + fdr001Line

# Add title, axis labels
xLabel <- "Log2 FC"
yLabel <- "-Log10 FDR"
title <- "DEGs in GSE129943 (-logFDR vs. logFC)"
plot <- plot + 
  ggplot2::ggtitle(title) + ggplot2::ylab(yLabel) + ggplot2::xlab(xLabel)

plot
A volcano plot showing the expression change (log2FC) and significance of the expression change after BH correction (-log10FDR) of MUC1, ATF6, and TMED9 relative to all other genes in the cleaned GSE129943 dataset.

Figure 2: A volcano plot showing the expression change (log2FC) and significance of the expression change after BH correction (-log10FDR) of MUC1, ATF6, and TMED9 relative to all other genes in the cleaned GSE129943 dataset.

The volcano plot in Figure 2 reveals that each gene of interest had a minimal, insignificant change in expression. To get more insight, we will also look at the unadjusted p-values for each of these genes.

kableExtra::kable_styling(
  knitr::kable(
    qlfAdjusted$table[c(muc1, atf6, tmed9), c("logFC", "PValue", "FDR")],
    caption="Log fold change, p-value, and FDR of MUC1, ATF6, and TMED9.", 
    type="html"),
  "condensed")
Table 4: Log fold change, p-value, and FDR of MUC1, ATF6, and TMED9.
logFC PValue FDR
ENSG00000185499 0.2084175 0.4502066 0.6360508
ENSG00000118217 0.1932326 0.1067418 0.2290702
ENSG00000184840 0.1220374 0.1363541 0.2755526

The unadjusted p-values for each of these genes was greater than 0.1, and therefore don’t pass even the weakest threshold of significance (Table 4). It is unclear whether the RNASeq technology used would differentiate between the MUC1-fs (frame shift) and MUC1-wt (wild type) transcripts. Assuming both types of transcripts get classified as MUC1, then we would not expect a significant difference in MUC1 expression between MKD cells and normal kidney cells (since the MUC1-fs mutation does not affect the transcription of the gene). Therefore, the very high p-value and relatively low log fold change for MUC1 is not surprising.

On the other hand, the low log fold changes for ATF6 and TMED9 are unexpected. Although differential expression analysis alone is not enough to confirm whether the entire ATF6 branch of UPR was up-regulated in MKD cells, one would expect that at least the ATF6 gene would be significantly up-regulated. However, this is not the case, so further investigation via gene set enrichment analysis is required.

For TMED9, the authors of the study made no claim about transcriptional up-regulation; they only observed increased protein abundance in MKD patient-derived organoids (Dvela-Levitt et al., 2019). However, one would expect the TMED9 gene to be significantly up-regulated in MKD cells in order to produce the observed TMED9 protein abundance. Again, further insight is required to determine why the results of this differential gene expression analysis seem to contradict the results of the study.


Heatmaps

Next, we will examine our expression set with a series of heatmaps.

plotHeatmap <- function(counts, title) {
  # Row normalization (scale each gene)
  # NOTE: need to use t() because scale operates on columns
  scaledCounts <- t(scale(t(counts)))
  
  # Define colours
  breaks <- c(min(scaledCounts), 0, max(scaledCounts))
  colors <- circlize::colorRamp2(breaks=breaks, 
                                 colors=c("red", "white", "blue"))
  
  # NOTE: use column_title as the main title
  heatmap <- ComplexHeatmap::Heatmap(
    scaledCounts, col=colors, show_column_names=T, show_row_names=F, 
    show_row_dend=T, show_column_dend=T, column_names_rot=45,
    row_title="Genes", column_title=title,
    row_title_side="left", column_title_side = "top",
    heatmap_legend_param=list(title = "Gene-Scaled Counts"))
  return(heatmap)
}

plotHeatmap(counts, paste0("Heatmap of Normalized and Gene-Scaled\n",
                           "RNASeq Counts in GSE129943 (all genes)"))
A heatmap showing the normalized and gene-scaled expression change of all genes in the cleaned GSE129943 dataset. Each row represents a gene, each column represents a biological sample, and both genes and samples are clustered together based on the expression data.

Figure 3: A heatmap showing the normalized and gene-scaled expression change of all genes in the cleaned GSE129943 dataset. Each row represents a gene, each column represents a biological sample, and both genes and samples are clustered together based on the expression data.

plotHeatmap(counts[rownames(sigResults), ], 
            paste0("Heatmap of Normalized and Gene-Scaled\n",
                   "RNASeq Counts in GSE129943 (FDR < 0.05)"))
plotHeatmap(counts[rownames(verySigResults), ], 
            paste0("Heatmap of Normalized and Gene-Scaled\n",
                   "RNASeq Counts in GSE129943 (FDR < 0.001)"))
Two heatmap showing the normalized and gene-scaled expression change of significant DEGs in the cleaned GSE129943 dataset. The left heatmap includes all DEGs with FDR < 0.05. The right heatmap includes all DEGs with FDR < 0.001. Each row represents a gene, each column represents a biological sample, and both genes and samples are clustered together based on the expression data.Two heatmap showing the normalized and gene-scaled expression change of significant DEGs in the cleaned GSE129943 dataset. The left heatmap includes all DEGs with FDR < 0.05. The right heatmap includes all DEGs with FDR < 0.001. Each row represents a gene, each column represents a biological sample, and both genes and samples are clustered together based on the expression data.

Figure 4: Two heatmap showing the normalized and gene-scaled expression change of significant DEGs in the cleaned GSE129943 dataset. The left heatmap includes all DEGs with FDR < 0.05. The right heatmap includes all DEGs with FDR < 0.001. Each row represents a gene, each column represents a biological sample, and both genes and samples are clustered together based on the expression data.

The first heatmap (Figure 3), has a visible signal differentiating MKD cells and normal kidney cells, despite inlcuding non-diffentially expressed genes. In the column dendogram, we observe that the MKD samples cluster together and the normal kidney samples cluster together, reflecting the strong separation we saw earlier in the MDS plot. When only DEGs are mapped (FDR < 0.05; Figure 4, left), the signal in the heatmap becomes a lot stronger. It looks as though each gene is either up-regulated in all the normal kidney samples and down-regulated in all the MKD samples, or vice-versa. The last heatmap (Figure 4, right), which includes only genes with FDR < 0.001, is very similar to the second. It shows the same strong signal, but has fewer white-colored or pale-colored cells. In other words, the final heatmap only shows genes that are strongly up- or down-regulated.


Over-Representation Analysis

Method and Platform

I will use Fisher’s Exact Test for my over-representation analysis. There are two main reasons why we are using Fisher’s Exact Test instead of a binomial or chi-squared test:

  1. Most publicly available ORA tools use Fisher’s Exact Test for their ORA (e.g., g:Profiler, PANTHER, Enrichr); by contrast, there are very few platforms that use the binomial or chi-squared test. This means that it is more practical and pragmatic
    to use Fisher’s Exact Test for gene set enrichment analysis, because it is already implemented in so many platforms. It also suggests that Fisher’s Exact Test is a better metric for ORA, because it is the method of choice by the researchers and consortia that implemented these ORA tools.
  2. Fisher’s Exact Test is a more exact statistical test for overrepresentation (Kim, 2017). By contrast, the chi-squared test is based on an approximation that requires a sufficiently large sample (Kim, 2017). Therefore, I believe that Fisher’s Exact Test will produce better results in my ORA than the other thresholded tests.

As stated above, there are multiple platforms that implement Fisher’s Exact Test in the context of thresholded gene set enrichment. I will use g:Profiler as my ORA tool because (a) it is frequently updated, with updates occurring approximately every three months, following quarterly releases of Ensembl databases; (b) it contains human gene set annotation from all the major annotation datasets (GO, Reactome, KEGG, WikiPathways), (c) it accepts ENSEMBL gene IDs in its gene list, which is the primary type of identifier I am using; (d) it has an accompanying R package, so I can query their database programatically; and (e) it is the platform I am most familiar with (Kolberg et al., 2020; Raudvere et al., 2019).


Annotation Datasets

To get the most insight into our data and retrieve all possible significant pathways, I will use all the pathway annotation datasets available in g:Profiler. This includes GO Biological Processes “The Gene Ontology Resource” (2021), KEGG pathways Kanehisa et al. (2021), Reactome data (Gillespie et al., 2022), and WikiPathways (Martens et al., 2021). I will not include electronic GO annotations because my thresholded gene lists are sufficiently long and should produce results even without electronic annotations.

The version of these four annotation datasets are listed in Table 5.

annotationData <- c("GO:BP", "KEGG", "REAC", "WP")

# Get version info for all annotation datasets
versions <- gprofiler2::get_version_info(organism = "hsapiens")

# Format into a readable table
versionsDf <- as.data.frame(do.call(rbind, versions$sources[annotationData]))
versionsDf$version <- gsub("\\n", "; ", versionsDf$version)
colnames(versionsDf) <- c("Annotation Dataset Name", "Version")
kableExtra::kable_styling(
  knitr::kable(versionsDf, 
             caption=paste("Version information for GO, KEGG, Reactome, and",
                           "WikiPathways pathway annotations"), type="html"),
  "condensed"
)
Table 5: Version information for GO, KEGG, Reactome, and WikiPathways pathway annotations
Annotation Dataset Name Version
GO:BP biological process annotations: BioMart; classes: releases/2021-12-15
KEGG Kyoto Encyclopedia of Genes and Genomes KEGG FTP Release 2021-12-27
REAC Reactome annotations: BioMart; classes: 2022-1-3
WP WikiPathways 20211210

GO and Reactome annotations rely on BioMart releases, so I will also note the BioMart version used in my over-rerpresentation analysis.

cat(paste("BioMart Version:", versions$biomart, versions$biomart_version))
## BioMart Version: Ensembl 105


ORA Results

First, I define two helper functions to help with my gene set enrichment analysis.

getGeneSets <- function(queryGenes, annotationSources, backgroundGenes=NULL) {
  # Determine whether the background for Fisher's test will be custom
  if (is.null(backgroundGenes)) {
    domainScope <- "annotated"
  } else {
    domainScope <- "custom"
  }
  
  return(
    gprofiler2::gost(query = queryGenes,organism = "hsapiens", 
                     ordered_query = F, multi_query = F, significant = F, 
                     exclude_iea = T, measure_underrepresentation = F, 
                     evcodes = F, user_threshold = 0.05, 
                     correction_method = "fdr", domain_scope = domainScope, 
                     custom_bg = backgroundGenes, sources = annotationSources)
  )
}

getTopGeneSets <- function(oraDf, annotationSource, numSets=10) {
  results <- oraDf[oraDf$source == annotationSource, ]
  topResults <- results[1:numSets, c("source", "term_id", "term_name", 
                                     "term_size", "p_value") ]
  colnames(topResults) <- c("Source", "Term ID", "Term Name", "Term Size", 
                            "P-Value")
  return(topResults)
}

I am choosing a significance threshold of 0.05 because it is the most common threshold to use and because I want to fetch as many gene sets as possible. I am again using the Benjamini-Hochberg procedure to correct for multiple hypothesis testing, for the same reasons described in Multiple Hypothesis Testing.


FDR < 0.05

I will start by performing ORA on significantly regulated genes (DEG threshold: FDR < 0.05).

Up-regulated
bgGenes <- rownames(counts) 

oraFile <- paste0(dataDir, "/upreg_ora_05.rds")
if (!file.exists(oraFile)) {
  # Fetch ORA results from g:Profiler if they aren't saved and save them
  upRegORA <- getGeneSets(upRegGenes, annotationData, bgGenes)
  saveRDS(upRegORA, file=oraFile)
} else {
  # Load ORA results (streamlines so I don't have to query g:Profiler each time)
  upRegORA <- readRDS(oraFile)
}

# Save all ORA results in list
allOras <- list()
allOras[[1]] <- list(result = upRegORA$result, 
                     sig = "FDR < 0.05", expr = "Up-regulated")
oraPlot <- gprofiler2::gostplot(upRegORA, capped=TRUE, interactive=TRUE)
plotly::layout(oraPlot, title = paste('Over-Representation of Gene Sets',
                                       'in Up-Regulated DEG List',
                                       '(Threshold: FDR < 0.05)'),
               xaxis = list(title = 'Annotation Dataset'))

Figure 5: A Manhattan plot showing all gene sets returned from running Fisher’s Exact Test (via g:Profiler) on significantly (FDR < 0.05) up-regulated genes in GSE129943. Colours correspond to different annotation datasets (GO:BP, KEGG, Reactome, WikiPathways). The y-axis shows how significantly over-represented each gene set is (-log10FDR).

The interactive plot in Figure 5 helps us easily identify the most significantly over-represented gene sets in our up-regulated gene list. However, if we look at the top give GO:BP gene sets from our results (Table 6), we notice that they have extremely vague names and each contains more than 3000 genes.

kableExtra::kable_styling(
  knitr::kable(getTopGeneSets(upRegORA$result, "GO:BP", numSets=5), 
               caption=paste("Top five over-represented GO:BP gene sets in ORA",
                             "of up-regulated DEGs with FDR < 0.05"),
               digits=42, type="html"),
  "condensed")
Table 6: Top five over-represented GO:BP gene sets in ORA of up-regulated DEGs with FDR < 0.05
Source Term ID Term Name Term Size P-Value
GO:BP GO:0050896 response to stimulus 4398 1.100000e-41
GO:BP GO:0008150 biological_process 10828 1.330195e-32
GO:BP GO:0048518 positive regulation of biological process 3649 5.069471e-32
GO:BP GO:0051716 cellular response to stimulus 3865 4.612058e-30
GO:BP GO:0032501 multicellular organismal process 3205 1.015804e-29

These “parent” gene sets are not informative, and therefore we will narrow down our analysis to smaller gene sets (containing no more than 200 genes).

showTopSmallGeneSets <- function(ora, annotationSources, numSets=10) {
  smallGeneSets <- ora$result[ora$result$term_size <= 200, ]
  for (dataset in annotationSources) {
    
    geneSetKable <- knitr::kable(
      getTopGeneSets(smallGeneSets, dataset, numSets=numSets), 
      caption=paste("Top", numSets, "over-represented small gene sets from", 
                    dataset, "in ORA of", ora$expr, "DEGs with", ora$sig, ".",
                    "'Small' gene sets are defined as any gene sets with at",
                    "most 200 genes."),
      digits=42, valign='t', row.names=FALSE, format="html")
    
    print(kableExtra::kable_styling(geneSetKable, "condensed"))
    cat('\n <br/> \n')
  }
}

showTopSmallGeneSets(allOras[[1]], annotationData)
Table 7: Top 10 over-represented small gene sets from GO:BP in ORA of Up-regulated DEGs with FDR < 0.05 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
GO:BP GO:0140546 defense response to symbiont 155 1.923044e-14
GO:BP GO:0051607 defense response to virus 155 1.923044e-14
GO:BP GO:1903900 regulation of viral life cycle 102 6.110663e-14
GO:BP GO:0050792 regulation of viral process 119 1.688086e-13
GO:BP GO:0019058 viral life cycle 196 1.103854e-12
GO:BP GO:0048525 negative regulation of viral process 69 2.111753e-11
GO:BP GO:0045071 negative regulation of viral genome replication 45 1.151882e-10
GO:BP GO:0019079 viral genome replication 103 7.242580e-10
GO:BP GO:0045069 regulation of viral genome replication 68 8.470246e-09
GO:BP GO:0002831 regulation of response to biotic stimulus 191 2.789158e-08

Table 7: Top 10 over-represented small gene sets from KEGG in ORA of Up-regulated DEGs with FDR < 0.05 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
KEGG KEGG:05169 Epstein-Barr virus infection 159 2.578725e-10
KEGG KEGG:03050 Proteasome 42 5.764911e-09
KEGG KEGG:05160 Hepatitis C 120 4.486228e-07
KEGG KEGG:05164 Influenza A 122 6.865981e-07
KEGG KEGG:04612 Antigen processing and presentation 43 1.896833e-06
KEGG KEGG:04210 Apoptosis 115 3.147859e-05
KEGG KEGG:05202 Transcriptional misregulation in cancer 126 3.147859e-05
KEGG KEGG:04360 Axon guidance 150 3.309012e-05
KEGG KEGG:04060 Cytokine-cytokine receptor interaction 113 1.027028e-04
KEGG KEGG:04217 Necroptosis 113 1.027028e-04

Table 7: Top 10 over-represented small gene sets from REAC in ORA of Up-regulated DEGs with FDR < 0.05 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
REAC REAC:R-HSA-913531 Interferon Signaling 154 1.913644e-15
REAC REAC:R-HSA-909733 Interferon alpha/beta signaling 53 7.018134e-14
REAC REAC:R-HSA-1236975 Antigen processing-Cross presentation 84 3.766138e-13
REAC REAC:R-HSA-1236974 ER-Phagosome pathway 76 2.323265e-12
REAC REAC:R-HSA-5357801 Programmed Cell Death 188 3.259930e-12
REAC REAC:R-HSA-8852276 The role of GTSE1 in G2/M progression after G2 checkpoint 67 6.759750e-12
REAC REAC:R-HSA-109581 Apoptosis 161 3.895050e-11
REAC REAC:R-HSA-2467813 Separation of Sister Chromatids 174 2.417929e-10
REAC REAC:R-HSA-174178 APC/C:Cdh1 mediated degradation of Cdc20 and other APC/C:Cdh1 targeted proteins in late mitosis/early G1 71 3.350562e-10
REAC REAC:R-HSA-176814 Activation of APC/C and APC/C:Cdc20 mediated degradation of mitotic proteins 74 3.371945e-10

Table 7: Top 10 over-represented small gene sets from WP in ORA of Up-regulated DEGs with FDR < 0.05 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
WP WP:WP183 Proteasome degradation 58 8.444368e-15
WP WP:WP5115 Network map of SARS-CoV-2 signaling pathway 136 1.988345e-12
WP WP:WP5039 SARS-CoV-2 innate immunity evasion and cell-specific immune response 50 3.512453e-07
WP WP:WP619 Type II interferon signaling (IFNG) 25 7.417131e-07
WP WP:WP254 Apoptosis 76 2.679573e-05
WP WP:WP2446 Retinoblastoma gene in cancer 87 2.679573e-05
WP WP:WP4197 Immune response to tuberculosis 22 2.679573e-05
WP WP:WP5083 Neuroinflammation and glutamatergic signaling 99 3.693096e-05
WP WP:WP2840 Hair follicle development: cytodifferentiation - part 3 of 3 46 4.883351e-05
WP WP:WP4304 Oligodendrocyte specification and differentiation, leading to myelin components for CNS 15 2.008485e-04


Table 7 shows the ten most significantly over-represented gene sets from each annotation source, based on the up-regulated gene list with threshold 0.05. For each annotation dataset, the top gene sets were related to viral response, interferon signaling, cell cycle, and apoptosis. While these were not mentioned in the study, it is believable that kidney cells might react similarly to a toxic proteinopathy as they do to viral infection (viral response generally involves changes in the cell cycle, apoptosis, and interferon signaling).


Down-regulated
oraFile <- paste0(dataDir, "/downreg_ora_05.rds")
if (!file.exists(oraFile)) {
  downRegORA <- getGeneSets(downRegGenes, annotationData, bgGenes)
  saveRDS(downRegORA, file=oraFile)
} else {
  downRegORA <- readRDS(oraFile)
}

allOras[[2]] <- list(result = downRegORA$result, 
                     sig = "FDR < 0.05", expr = "Down-regulated")
oraPlot <- gprofiler2::gostplot(downRegORA, capped=TRUE, interactive=TRUE)
plotly::layout(oraPlot, title = paste('Over-Representation of Gene Sets',
                                       'in Down-Regulated DEG List',
                                       '(Threshold: FDR < 0.05)'),
               xaxis = list(title = 'Annotation Dataset'), 
               yaxis = list(title = '-log10(FDR) (capped at 10^-16)'))

Figure 6: A Manhattan plot showing all gene sets returned from running Fisher’s Exact Test (via g:Profiler) on significantly (FDR < 0.05) down-regulated genes in GSE129943. Colours correspond to different annotation datasets (GO:BP, KEGG, Reactome, WikiPathways). The y-axis shows how significantly over-represented each gene set is (-log10FDR).

Table 8: Ten most significantly over-represented gene sets from each annotation source, based on the down-regulated gene list with threshold 0.05
showTopSmallGeneSets(allOras[[2]], annotationData)
Table 8: Top 10 over-represented small gene sets from GO:BP in ORA of Down-regulated DEGs with FDR < 0.05 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
GO:BP GO:0002181 cytoplasmic translation 130 1.259901e-24
GO:BP GO:0036293 response to decreased oxygen levels 138 2.403078e-09
GO:BP GO:0001666 response to hypoxia 135 3.103712e-09
GO:BP GO:0070482 response to oxygen levels 151 8.312130e-09
GO:BP GO:0036294 cellular response to decreased oxygen levels 88 1.039810e-06
GO:BP GO:0007610 behavior 130 1.291760e-06
GO:BP GO:0071456 cellular response to hypoxia 85 1.346778e-06
GO:BP GO:0071453 cellular response to oxygen levels 100 1.077212e-05
GO:BP GO:0045229 external encapsulating structure organization 140 1.816234e-04
GO:BP GO:0051960 regulation of nervous system development 186 2.779549e-04

Table 8: Top 10 over-represented small gene sets from KEGG in ORA of Down-regulated DEGs with FDR < 0.05 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
KEGG KEGG:03010 Ribosome 127 7.783092e-21
KEGG KEGG:05171 Coronavirus disease - COVID-19 170 3.064324e-17
KEGG KEGG:04066 HIF-1 signaling pathway 83 5.906660e-04
KEGG KEGG:04974 Protein digestion and absorption 46 5.906660e-04
KEGG KEGG:04140 Autophagy - animal 132 9.272957e-04
KEGG KEGG:00010 Glycolysis / Gluconeogenesis 43 2.136640e-03
KEGG KEGG:05230 Central carbon metabolism in cancer 55 2.136640e-03
KEGG KEGG:04150 mTOR signaling pathway 132 7.217849e-03
KEGG KEGG:01230 Biosynthesis of amino acids 56 7.217849e-03
KEGG KEGG:05211 Renal cell carcinoma 66 8.472398e-03

Table 8: Top 10 over-represented small gene sets from REAC in ORA of Down-regulated DEGs with FDR < 0.05 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
REAC REAC:R-HSA-156842 Eukaryotic Translation Elongation 89 8.519000e-39
REAC REAC:R-HSA-156902 Peptide chain elongation 85 8.519000e-39
REAC REAC:R-HSA-192823 Viral mRNA Translation 85 1.307770e-37
REAC REAC:R-HSA-72689 Formation of a pool of free 40S subunits 97 1.615670e-37
REAC REAC:R-HSA-156827 L13a-mediated translational silencing of Ceruloplasmin expression 107 1.621230e-37
REAC REAC:R-HSA-72706 GTP hydrolysis and joining of the 60S ribosomal subunit 108 5.528209e-36
REAC REAC:R-HSA-9633012 Response of EIF2AK4 (GCN2) to amino acid deficiency 97 2.166951e-35
REAC REAC:R-HSA-72613 Eukaryotic Translation Initiation 115 2.166951e-35
REAC REAC:R-HSA-72737 Cap-dependent Translation Initiation 115 2.166951e-35
REAC REAC:R-HSA-975956 Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex (EJC) 91 7.906436e-35

Table 8: Top 10 over-represented small gene sets from WP in ORA of Down-regulated DEGs with FDR < 0.05 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
WP WP:WP477 Cytoplasmic ribosomal proteins 85 5.919000e-39
WP WP:WP4018 Clear cell renal cell carcinoma pathways 72 2.081253e-10
WP WP:WP1946 Cori cycle 13 4.594145e-06
WP WP:WP4629 Aerobic glycolysis 11 4.594145e-06
WP WP:WP534 Glycolysis and gluconeogenesis 33 8.763435e-06
WP WP:WP3614 Photodynamic therapy-induced HIF-1 survival signaling 32 1.608166e-04
WP WP:WP4290 Metabolic reprogramming in colon cancer 40 4.243684e-03
WP WP:WP107 Translation factors 49 5.212685e-03
WP WP:WP5094 Orexin receptor pathway 96 7.943289e-03
WP WP:WP4549 Fragile X syndrome 97 8.900015e-03



The common terms in the gene sets listed in Table 8 are translation, elongation, ribosomes, autophagy, glucogenesis and synthesis of nucleotide sugars. Again, it is logical that a cell affected by the accumulation of a misfolded protein would down-regulate its protein synthesis (ribosomal assembly, peptide translation and elongation, amino acid biosynthesis), in order to prevent further aggregation. Additionally, the down-regulation of autophagy is consistent with the study’s observations that MKD cells seem unable to clear the misfolded Muc1 proteins through lysosomal degradation. The other top terms from the annotation datasets are less cohesive: among them are response to hypoxia, demethylation, HIG-1 signaling, and carcinoma pathways.


All DEGs
oraFile <- paste0(dataDir, "/alldegs_ora_05.rds")
if (!file.exists(oraFile)) {
  allGenes <- c(upRegGenes, downRegGenes)
  sigORA <- getGeneSets(allGenes, annotationData, bgGenes)
  saveRDS(sigORA, file=oraFile)
} else {
  sigORA <- readRDS(oraFile)
}

allOras[[3]] <- list(result = sigORA$result, 
                     sig = "FDR < 0.05", expr = "All DEGs")
oraPlot <- gprofiler2::gostplot(sigORA, capped=TRUE, interactive=TRUE)
plotly::layout(oraPlot, title = paste('Over-Representation of Gene Sets',
                                       'in DEG List (Threshold: FDR < 0.05)'),
               xaxis = list(title = 'Annotation Dataset'))

Figure 7: A Manhattan plot showing all gene sets returned from running Fisher’s Exact Test (via g:Profiler) on all significant (FDR < 0.05) DEGs in GSE129943. Colours correspond to different annotation datasets (GO:BP, KEGG, Reactome, WikiPathways). The y-axis shows how significantly over-represented each gene set is (-log10FDR).

Table 9: Ten most significantly over-represented gene sets from each annotation source, based on the DEG list with threshold 0.05
showTopSmallGeneSets(allOras[[3]], annotationData)
Table 9: Top 10 over-represented small gene sets from GO:BP in ORA of All DEGs DEGs with FDR < 0.05 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
GO:BP GO:0002181 cytoplasmic translation 130 2.755235e-10
GO:BP GO:0045229 external encapsulating structure organization 140 1.711119e-09
GO:BP GO:0030198 extracellular matrix organization 138 5.489826e-09
GO:BP GO:0043062 extracellular structure organization 138 5.489826e-09
GO:BP GO:0042692 muscle cell differentiation 146 2.808452e-08
GO:BP GO:0140546 defense response to symbiont 155 2.808452e-08
GO:BP GO:0051607 defense response to virus 155 2.808452e-08
GO:BP GO:0007411 axon guidance 116 4.370059e-08
GO:BP GO:0097485 neuron projection guidance 116 4.370059e-08
GO:BP GO:0044057 regulation of system process 197 8.110596e-08

Table 9: Top 10 over-represented small gene sets from KEGG in ORA of All DEGs DEGs with FDR < 0.05 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
KEGG KEGG:05171 Coronavirus disease - COVID-19 170 9.549956e-17
KEGG KEGG:04512 ECM-receptor interaction 63 2.850300e-06
KEGG KEGG:05412 Arrhythmogenic right ventricular cardiomyopathy 51 2.850300e-06
KEGG KEGG:03010 Ribosome 127 3.749980e-06
KEGG KEGG:05410 Hypertrophic cardiomyopathy 57 8.562737e-06
KEGG KEGG:05206 MicroRNAs in cancer 142 1.022466e-05
KEGG KEGG:04514 Cell adhesion molecules 79 1.022466e-05
KEGG KEGG:05202 Transcriptional misregulation in cancer 126 1.022466e-05
KEGG KEGG:04974 Protein digestion and absorption 46 1.047406e-05
KEGG KEGG:05169 Epstein-Barr virus infection 159 1.676900e-05

Table 9: Top 10 over-represented small gene sets from REAC in ORA of All DEGs DEGs with FDR < 0.05 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
REAC REAC:R-HSA-156902 Peptide chain elongation 85 9.913603e-18
REAC REAC:R-HSA-156842 Eukaryotic Translation Elongation 89 1.648108e-17
REAC REAC:R-HSA-192823 Viral mRNA Translation 85 6.160864e-17
REAC REAC:R-HSA-1474244 Extracellular matrix organization 196 6.160864e-17
REAC REAC:R-HSA-9010553 Regulation of expression of SLITs and ROBOs 156 1.132525e-16
REAC REAC:R-HSA-376176 Signaling by ROBO receptors 194 6.919092e-16
REAC REAC:R-HSA-72689 Formation of a pool of free 40S subunits 97 1.264974e-15
REAC REAC:R-HSA-72764 Eukaryotic Translation Termination 89 3.968572e-15
REAC REAC:R-HSA-1799339 SRP-dependent cotranslational protein targeting to membrane 108 4.400929e-15
REAC REAC:R-HSA-975956 Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex (EJC) 91 4.400929e-15

Table 9: Top 10 over-represented small gene sets from WP in ORA of All DEGs DEGs with FDR < 0.05 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
WP WP:WP477 Cytoplasmic ribosomal proteins 85 1.387354e-18
WP WP:WP5115 Network map of SARS-CoV-2 signaling pathway 136 5.212673e-09
WP WP:WP5055 Burn wound healing 73 3.682437e-08
WP WP:WP183 Proteasome degradation 58 5.418779e-08
WP WP:WP2328 Allograft Rejection 40 5.549131e-07
WP WP:WP4018 Clear cell renal cell carcinoma pathways 72 3.321048e-06
WP WP:WP2877 Vitamin D receptor pathway 107 1.018915e-05
WP WP:WP2118 Arrhythmogenic right ventricular cardiomyopathy 50 1.591225e-05
WP WP:WP2431 Spinal cord injury 79 3.381740e-05
WP WP:WP5083 Neuroinflammation and glutamatergic signaling 99 3.381740e-05



Even though there are more significantly up-regulated genes than there are down-regulated genes in our dataset (2560 vs. 2042), it seems that the down-regulated pathways (specifically, ribosomal functions, translation, elongation) were more significantly over-represented in all the annotation datasets than the up-regulated pathways (Table 9).

getNumSigPathways <- function(ora) {
  return(length(which(ora$p_value < 0.05)))
}

compareNumGeneSets <- function(oras, fdrs, logFCs, caption) {
  numPathways <- unlist(lapply(oras, getNumSigPathways))
  
  exprChanges <- unlist(lapply(logFCs, function(foldChange) {
    if (foldChange == "0") {
      return("All DEGs")
    } else if (foldChange == "+") {
      return("Up-regulated")
    } else if (foldChange == "-") {
      return("Down-regulated")
    }
  }))
  
  gsKable <- knitr::kable(
    data.frame(`DEG Sig. Threshold` = fdrs, `Expression Change` = exprChanges,
               `Num. Gene Sets` = numPathways,  check.names = FALSE), 
    caption=caption, type="html")
  kableExtra::kable_styling(gsKable, "condensed")
}

oras <- list(sigORA$result, upRegORA$result, downRegORA$result)
compareNumGeneSets(oras, fdrs=rep("FDR < 0.05", 3), logFCs=c("0", "+", "-"),
                   caption = paste("Number of significantly over-represented",
                                   "gene sets in gene lists with threshold",
                                   "FDR < 0.05."))
Table 10: Number of significantly over-represented gene sets in gene lists with threshold FDR < 0.05.
DEG Sig. Threshold Expression Change Num. Gene Sets
FDR < 0.05 All DEGs 1773
FDR < 0.05 Up-regulated 1488
FDR < 0.05 Down-regulated 449

As expected, the cumulative ORA (all significantly regulated genes) returned more gene sets than either of the individual ORAs (with only up- or down- regulated genes) (Table 10). However, the cumulative ORA returned fewer gene sets than the sum of the gene sets from the individual ORAs. The ORA on only down-regulated genes returned far fewer gene sets than the ORA with only up-regulated genes (~30% as many gene sets), even though the down-regulated gene list was almost 80% the length of the up-regulated gene list.


FDR < 0.001

Now that I have a big picture view of the enriched pathways in MKD cells, I will use the gene lists generated by the stricter threshold (FDR < 0.001) to identify the most important and significantly regulated pathways.

Up-regulated
oraFile <- paste0(dataDir, "/upreg_ora_001.rds")
if (!file.exists(oraFile)) {
  veryUpRegORA <- getGeneSets(veryUpRegGenes, annotationData, bgGenes)
  saveRDS(veryUpRegORA, file=oraFile)
} else {
  veryUpRegORA <- readRDS(oraFile)
}

allOras[[4]] <- list(result = veryUpRegORA$result, 
                     sig = "FDR < 0.001", expr = "Up-regulated")
oraPlot <- gprofiler2::gostplot(veryUpRegORA, capped=TRUE, interactive=TRUE)
plotly::layout(oraPlot, title = paste('Over-Representation of Gene Sets',
                                       'in Up-Regulated DEG List',
                                       '(Threshold: FDR < 0.001)'),
               xaxis = list(title = 'Annotation Dataset'))

Figure 8: A Manhattan plot showing all gene sets returned from running Fisher’s Exact Test (via g:Profiler) on all very significantly (FDR < 0.001) up-regulated genes in GSE129943. Colours correspond to different annotation datasets (GO:BP, KEGG, Reactome, WikiPathways). The y-axis shows how significantly over-represented each gene set is (-log10FDR).

Table 11: Ten most significantly over-represented gene sets from each annotation source, based on the up-regulated gene list with threshold 0.001
showTopSmallGeneSets(allOras[[4]], annotationData)
Table 11: Top 10 over-represented small gene sets from GO:BP in ORA of Up-regulated DEGs with FDR < 0.001 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
GO:BP GO:0051607 defense response to virus 155 3.521539e-24
GO:BP GO:0140546 defense response to symbiont 155 3.521539e-24
GO:BP GO:0048525 negative regulation of viral process 69 1.359792e-18
GO:BP GO:1903900 regulation of viral life cycle 102 5.108345e-18
GO:BP GO:0050792 regulation of viral process 119 5.165447e-18
GO:BP GO:0045071 negative regulation of viral genome replication 45 1.085921e-16
GO:BP GO:0019058 viral life cycle 196 3.041602e-15
GO:BP GO:0002683 negative regulation of immune system process 181 6.477933e-14
GO:BP GO:0045069 regulation of viral genome replication 68 3.921574e-13
GO:BP GO:0019079 viral genome replication 103 2.720715e-12

Table 11: Top 10 over-represented small gene sets from KEGG in ORA of Up-regulated DEGs with FDR < 0.001 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
KEGG KEGG:05169 Epstein-Barr virus infection 159 1.189207e-10
KEGG KEGG:05164 Influenza A 122 3.553557e-10
KEGG KEGG:05160 Hepatitis C 120 8.128932e-07
KEGG KEGG:05162 Measles 104 2.176205e-06
KEGG KEGG:04612 Antigen processing and presentation 43 2.176205e-06
KEGG KEGG:03050 Proteasome 42 7.346302e-06
KEGG KEGG:05167 Kaposi sarcoma-associated herpesvirus infection 149 3.085640e-05
KEGG KEGG:05170 Human immunodeficiency virus 1 infection 167 1.531489e-04
KEGG KEGG:04514 Cell adhesion molecules 79 1.531489e-04
KEGG KEGG:05418 Fluid shear stress and atherosclerosis 111 2.023412e-04

Table 11: Top 10 over-represented small gene sets from REAC in ORA of Up-regulated DEGs with FDR < 0.001 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
REAC REAC:R-HSA-913531 Interferon Signaling 154 4.041223e-24
REAC REAC:R-HSA-909733 Interferon alpha/beta signaling 53 1.205883e-21
REAC REAC:R-HSA-877300 Interferon gamma signaling 64 1.173120e-13
REAC REAC:R-HSA-1236974 ER-Phagosome pathway 76 3.518394e-11
REAC REAC:R-HSA-1236975 Antigen processing-Cross presentation 84 1.403507e-10
REAC REAC:R-HSA-381426 Regulation of Insulin-like Growth Factor (IGF) transport and uptake by Insulin-like Growth Factor Binding Proteins (IGFBPs) 78 2.634256e-09
REAC REAC:R-HSA-8957275 Post-translational protein phosphorylation 73 1.208023e-08
REAC REAC:R-HSA-5357801 Programmed Cell Death 188 1.760269e-08
REAC REAC:R-HSA-8852276 The role of GTSE1 in G2/M progression after G2 checkpoint 67 3.157421e-08
REAC REAC:R-HSA-109581 Apoptosis 161 8.001187e-08

Table 11: Top 10 over-represented small gene sets from WP in ORA of Up-regulated DEGs with FDR < 0.001 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
WP WP:WP5115 Network map of SARS-CoV-2 signaling pathway 136 1.258866e-12
WP WP:WP183 Proteasome degradation 58 2.042227e-12
WP WP:WP619 Type II interferon signaling (IFNG) 25 5.861221e-10
WP WP:WP4197 Immune response to tuberculosis 22 8.494371e-09
WP WP:WP4868 Type I interferon induction and signaling during SARS-CoV-2 infection 26 2.181967e-07
WP WP:WP5039 SARS-CoV-2 innate immunity evasion and cell-specific immune response 50 2.450495e-07
WP WP:WP2840 Hair follicle development: cytodifferentiation - part 3 of 3 46 7.812426e-06
WP WP:WP4880 Host-pathogen interaction of human coronaviruses - interferon induction 30 1.301529e-05
WP WP:WP5083 Neuroinflammation and glutamatergic signaling 99 3.412772e-05
WP WP:WP4217 Ebola virus pathway in host 107 1.629663e-04



The top gene sets returned by this analysis, listed in Table 11, were almost exactly identical to the top gene sets returned by the ORA with significantly (FDR < 0.05) up-regulated genes (Table 7). The only difference is that the p-value associated with each returned gene set is a lot smaller (and more significant) in this analysis than the p-value from the previous (FDR < 0.05) analysis. This is probably because the query gene list is smaller, and so the gene sets are even more over-represented.

oras <- list(upRegORA$result, veryUpRegORA$result)
compareNumGeneSets(oras, fdrs=c("FDR < 0.05", "FDR < 0.001"), 
                   logFCs=c("+", "+"), 
                   caption = paste("Number of significantly over-represented",
                                   "gene sets in up-regulated gene lists with",
                                   "different thresholds."))
Table 12: Number of significantly over-represented gene sets in up-regulated gene lists with different thresholds.
DEG Sig. Threshold Expression Change Num. Gene Sets
FDR < 0.05 Up-regulated 1488
FDR < 0.001 Up-regulated 1570

Table 12 shows that the very significantly (FDR < 0.001) up-regulated gene list returned 82 more significantly expressed gene sets than the up-regulated gene list with FDR < 0.05.


Down-regulated
oraFile <- paste0(dataDir, "/downreg_ora_001.rds")
if (!file.exists(oraFile)) {
  veryDownRegORA <- getGeneSets(veryDownRegGenes, annotationData, bgGenes)
  saveRDS(veryDownRegORA, file=oraFile)
} else {
  veryDownRegORA <- readRDS(oraFile)
}

allOras[[5]] <- list(result = veryDownRegORA$result, 
                     sig = "FDR < 0.001", expr = "Down-regulated")
oraPlot <- 
  gprofiler2::gostplot(veryDownRegORA, capped=TRUE, interactive=TRUE)
plotly::layout(oraPlot, title = paste('Over-Representation of Gene Sets',
                                       'in Down-Regulated DEG List',
                                       '(Threshold: FDR < 0.001)'),
               xaxis = list(title = 'Annotation Dataset'))

Figure 9: A Manhattan plot showing all gene sets returned from running Fisher’s Exact Test (via g:Profiler) on all very significantly (FDR < 0.001) down-regulated genes in GSE129943. Colours correspond to different annotation datasets (GO:BP, KEGG, Reactome, WikiPathways). The y-axis shows how significantly over-represented each gene set is (-log10FDR).

Table 13: Ten most significantly over-represented gene sets from each annotation source, based on the down-regulated gene list with threshold 0.001
showTopSmallGeneSets(allOras[[5]], annotationData)
Table 13: Top 10 over-represented small gene sets from GO:BP in ORA of Down-regulated DEGs with FDR < 0.001 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
GO:BP GO:0001666 response to hypoxia 135 4.774269e-11
GO:BP GO:0036293 response to decreased oxygen levels 138 9.364504e-11
GO:BP GO:0070482 response to oxygen levels 151 4.529153e-10
GO:BP GO:0002181 cytoplasmic translation 130 6.159332e-08
GO:BP GO:0071456 cellular response to hypoxia 85 1.885910e-07
GO:BP GO:0098742 cell-cell adhesion via plasma-membrane adhesion molecules 99 2.976675e-07
GO:BP GO:0036294 cellular response to decreased oxygen levels 88 3.688665e-07
GO:BP GO:0071453 cellular response to oxygen levels 100 1.182984e-06
GO:BP GO:0035725 sodium ion transmembrane transport 71 2.953592e-05
GO:BP GO:0045229 external encapsulating structure organization 140 8.088140e-05

Table 13: Top 10 over-represented small gene sets from KEGG in ORA of Down-regulated DEGs with FDR < 0.001 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
KEGG KEGG:03010 Ribosome 127 2.047333e-06
KEGG KEGG:05171 Coronavirus disease - COVID-19 170 8.714660e-06
KEGG KEGG:04066 HIF-1 signaling pathway 83 2.960316e-04
KEGG KEGG:00010 Glycolysis / Gluconeogenesis 43 3.638603e-03
KEGG KEGG:04512 ECM-receptor interaction 63 3.638603e-03
KEGG KEGG:04974 Protein digestion and absorption 46 6.363647e-03
KEGG KEGG:00500 Starch and sucrose metabolism 20 1.972024e-02
KEGG KEGG:04020 Calcium signaling pathway 119 2.407286e-02
KEGG KEGG:01250 Biosynthesis of nucleotide sugars 34 2.407286e-02
KEGG KEGG:04972 Pancreatic secretion 47 2.407286e-02

Table 13: Top 10 over-represented small gene sets from REAC in ORA of Down-regulated DEGs with FDR < 0.001 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
REAC REAC:R-HSA-156827 L13a-mediated translational silencing of Ceruloplasmin expression 107 1.024888e-12
REAC REAC:R-HSA-156842 Eukaryotic Translation Elongation 89 1.024888e-12
REAC REAC:R-HSA-72689 Formation of a pool of free 40S subunits 97 1.024888e-12
REAC REAC:R-HSA-72737 Cap-dependent Translation Initiation 115 1.094750e-12
REAC REAC:R-HSA-72613 Eukaryotic Translation Initiation 115 1.094750e-12
REAC REAC:R-HSA-156902 Peptide chain elongation 85 1.711348e-12
REAC REAC:R-HSA-192823 Viral mRNA Translation 85 1.711348e-12
REAC REAC:R-HSA-72706 GTP hydrolysis and joining of the 60S ribosomal subunit 108 2.660910e-12
REAC REAC:R-HSA-975956 Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex (EJC) 91 1.247020e-11
REAC REAC:R-HSA-9633012 Response of EIF2AK4 (GCN2) to amino acid deficiency 97 1.366156e-11

Table 13: Top 10 over-represented small gene sets from WP in ORA of Down-regulated DEGs with FDR < 0.001 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
WP WP:WP477 Cytoplasmic ribosomal proteins 85 2.402055e-11
WP WP:WP4018 Clear cell renal cell carcinoma pathways 72 6.477499e-08
WP WP:WP4629 Aerobic glycolysis 11 2.137217e-05
WP WP:WP3614 Photodynamic therapy-induced HIF-1 survival signaling 32 2.987092e-05
WP WP:WP1946 Cori cycle 13 9.643910e-05
WP WP:WP534 Glycolysis and gluconeogenesis 33 2.157779e-04
WP WP:WP5097 CCL18 signaling pathway 40 1.631661e-03
WP WP:WP4786 Type I collagen synthesis in the context of osteogenesis imperfecta 28 7.995232e-03
WP WP:WP5055 Burn wound healing 73 1.463845e-02
WP WP:WP4707 Aspirin and miRNAs 10 1.612661e-02



compareNumGeneSets(list(downRegORA$result, veryDownRegORA$result), 
                   fdrs=c("FDR < 0.05", "FDR < 0.001"), 
                   logFCs=c("-", "-"), 
                   caption = paste("Number of significantly over-represented",
                                   "gene sets in down-regulated gene lists with",
                                   "different thresholds."))
Table 14: Number of significantly over-represented gene sets in down-regulated gene lists with different thresholds.
DEG Sig. Threshold Expression Change Num. Gene Sets
FDR < 0.05 Down-regulated 449
FDR < 0.001 Down-regulated 542

As with the up-regulated ORAs, the gene sets returned in this analysis (listed in Table 13) are almost exactly the same and in nearly the same order as gene sets from the ORA with significantly (FDR < 0.05) down-regulated genes (Table 8). Surprisingly, the p-values were bigger (less significant) in the ORA produced by the smaller gene list (stricter threshold). This suggests that the less significant DEGs were important in the down-regulated gene set enrichment analysis, since removing them resulted in less significant results.

The ORA from the stricter-thresholded down-regulated gene list (FDR < 0.001) resulted in 93 more significant gene sets than the weaker-thresholded gene list (FDR < 0.05) (Table 14).


All DEGs
# Load or fetch ORA results for all very sig DEGs
oraFile <- paste0(dataDir, "/alldegs_ora_001.rds")
if (!file.exists(oraFile)) {
  allGenes <- c(veryUpRegGenes, veryDownRegGenes)
  verySigORA <- getGeneSets(allGenes, annotationData, bgGenes)
  saveRDS(verySigORA, file=oraFile)
} else {
  verySigORA <- readRDS(oraFile)
}

# Save to list of ORAs
allOras[[6]] <- list(result = verySigORA$result, 
                     sig = "FDR < 0.001", expr = "All")

oraPlot <- gprofiler2::gostplot(verySigORA, capped=TRUE, interactive=TRUE)
plotly::layout(oraPlot, title = paste('Over-Representation of Gene Sets',
                                       'in DEG List (Threshold: FDR < 0.001)'),
               xaxis = list(title = 'Annotation Dataset'), 
               yaxis = list(title = '-log10(FDR) (capped at 10^-16)'))

Figure 10: A Manhattan plot showing all gene sets returned from running Fisher’s Exact Test (via g:Profiler) on all very significant (FDR < 0.001) DEGs in GSE129943. Colours correspond to different annotation datasets (GO:BP, KEGG, Reactome, WikiPathways). The y-axis shows how significantly over-represented each gene set is (-log10FDR).

Table 15: Ten most significantly over-represented gene sets from each annotation source, based on the DEG list with threshold 0.001
showTopSmallGeneSets(allOras[[6]], annotationData)
Table 15: Top 10 over-represented small gene sets from GO:BP in ORA of All DEGs with FDR < 0.001 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
GO:BP GO:0051607 defense response to virus 155 2.768761e-15
GO:BP GO:0140546 defense response to symbiont 155 2.768761e-15
GO:BP GO:0002683 negative regulation of immune system process 181 4.123843e-12
GO:BP GO:0098742 cell-cell adhesion via plasma-membrane adhesion molecules 99 7.209449e-12
GO:BP GO:0050792 regulation of viral process 119 1.112942e-11
GO:BP GO:1903900 regulation of viral life cycle 102 2.987380e-11
GO:BP GO:0048525 negative regulation of viral process 69 3.196549e-11
GO:BP GO:0045071 negative regulation of viral genome replication 45 3.746031e-11
GO:BP GO:0007162 negative regulation of cell adhesion 159 3.130204e-10
GO:BP GO:0019058 viral life cycle 196 3.666078e-10

Table 15: Top 10 over-represented small gene sets from KEGG in ORA of All DEGs with FDR < 0.001 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
KEGG KEGG:05171 Coronavirus disease - COVID-19 170 2.704195e-10
KEGG KEGG:04512 ECM-receptor interaction 63 4.614595e-08
KEGG KEGG:05169 Epstein-Barr virus infection 159 2.597519e-06
KEGG KEGG:05202 Transcriptional misregulation in cancer 126 5.516569e-06
KEGG KEGG:04514 Cell adhesion molecules 79 5.516569e-06
KEGG KEGG:05164 Influenza A 122 5.516569e-06
KEGG KEGG:04510 Focal adhesion 163 7.978976e-06
KEGG KEGG:04974 Protein digestion and absorption 46 1.865757e-05
KEGG KEGG:05418 Fluid shear stress and atherosclerosis 111 2.724035e-05
KEGG KEGG:04020 Calcium signaling pathway 119 2.775534e-05

Table 15: Top 10 over-represented small gene sets from REAC in ORA of All DEGs with FDR < 0.001 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
REAC REAC:R-HSA-1474244 Extracellular matrix organization 196 7.676585e-20
REAC REAC:R-HSA-913531 Interferon Signaling 154 6.069943e-15
REAC REAC:R-HSA-909733 Interferon alpha/beta signaling 53 2.470923e-13
REAC REAC:R-HSA-381426 Regulation of Insulin-like Growth Factor (IGF) transport and uptake by Insulin-like Growth Factor Binding Proteins (IGFBPs) 78 1.659044e-11
REAC REAC:R-HSA-8957275 Post-translational protein phosphorylation 73 1.220846e-10
REAC REAC:R-HSA-1474290 Collagen formation 64 2.862371e-09
REAC REAC:R-HSA-877300 Interferon gamma signaling 64 2.862371e-09
REAC REAC:R-HSA-2022090 Assembly of collagen fibrils and other multimeric structures 44 1.350811e-08
REAC REAC:R-HSA-1236974 ER-Phagosome pathway 76 2.316021e-07
REAC REAC:R-HSA-6785807 Interleukin-4 and Interleukin-13 signaling 70 2.684536e-07

Table 15: Top 10 over-represented small gene sets from WP in ORA of All DEGs with FDR < 0.001 . ‘Small’ gene sets are defined as any gene sets with at most 200 genes.
Source Term ID Term Name Term Size P-Value
WP WP:WP5115 Network map of SARS-CoV-2 signaling pathway 136 1.695931e-09
WP WP:WP5055 Burn wound healing 73 6.377130e-08
WP WP:WP183 Proteasome degradation 58 6.539560e-08
WP WP:WP619 Type II interferon signaling (IFNG) 25 2.597927e-07
WP WP:WP2911 miRNA targets in ECM and membrane receptors 18 2.704356e-06
WP WP:WP4018 Clear cell renal cell carcinoma pathways 72 7.120623e-06
WP WP:WP3614 Photodynamic therapy-induced HIF-1 survival signaling 32 8.286784e-06
WP WP:WP4197 Immune response to tuberculosis 22 1.187142e-05
WP WP:WP306 Focal adhesion 160 1.214604e-05
WP WP:WP4868 Type I interferon induction and signaling during SARS-CoV-2 infection 26 3.269936e-05



Unlike the previous two ORAs, this analysis returned remarkably different gene sets than its less significant counterpart (all DEGs, FDR < 0.05). The top gene sets in this cumulative ORA (listed in Table 15) had primarily up-regulated gene sets, whereas the previous cumulative ORA had mostly down-regulated gene sets (Table 9). In particular, I noticed that gene sets related to ribosomes, translation, and elongation did not make the top-ten list in this analysis, and were instead replaced with gene sets related to viral response, cell cycle, and interferon signaling. This reflects the results from above, where the up-regulated ORA with FDR < 0.001 DEGs had more significant p-values but the down-regulated ORA with FDR < 0.001 DEGs had less significant p-values. Thus, it seems that using a more restricted gene list (with a more stringent threshold) causes up-regulated pathways to be more significantly over-represented than down-regulated pathways. This suggests that the most significant DEGs are up-regulated rather than down-regulated. The volcano plot earlier in this report confirms this pattern: the up-regulated DEGs had larger logFCs than the down-regulated DEGs, and the top part of the plot was more densely populated with red dots (up-regulated genes) than blue dots (down-regulated genes).

oras <- list(sigORA$result, verySigORA$result)
compareNumGeneSets(oras, fdrs=c("FDR < 0.05", "FDR < 0.001"), 
                   logFCs=c("0", "0"), 
                   caption = paste("Number of significantly over-represented",
                                   "gene sets in DEG lists with two",
                                   "different thresholds."))
Table 16: Number of significantly over-represented gene sets in DEG lists with two different thresholds.
DEG Sig. Threshold Expression Change Num. Gene Sets
FDR < 0.05 All DEGs 1773
FDR < 0.001 All DEGs 1882

Table 16 shows that the ORA from the stricter-thresholded DEG list (FDR < 0.001) returned 109 more significant gene sets than the weaker-thresholded DEG list (FDR < 0.05).

oras <- list(verySigORA$result, veryUpRegORA$result, veryDownRegORA$result)
compareNumGeneSets(oras, fdrs=rep("FDR < 0.001", 3), logFCs=c("0", "+", "-"),
                   caption = paste("Number of significantly over-represented",
                                   "gene sets in gene lists with threshold",
                                   "FDR < 0.001."))
Table 17: Number of significantly over-represented gene sets in gene lists with threshold FDR < 0.001.
DEG Sig. Threshold Expression Change Num. Gene Sets
FDR < 0.001 All DEGs 1882
FDR < 0.001 Up-regulated 1570
FDR < 0.001 Down-regulated 542

Table 17 shows the number of gene sets returned in each ORA that used a DEG significance threshold of 0.001. As in the weaker-threshold (FDR < 0.05) case, the number of gene sets returned by the cumulative ORA (all DEGs) was greater than either of the individual ORAs (only up- or down-regulated DEGs), but less than the sum of the individual ORAs. Again, the down-regulated gene list returned much fewer gene sets than the up-regulated gene list.


Table 18 summarizes the results of the six ORAs performed above. Each cell shows the number of significantly over-represented gene sets found in each gene list, and each gene list is characterized by the FDR threshold (columns) and the expression change (rows).

numSigPathways <- unlist(lapply(
  list(sigORA$result, upRegORA$result, downRegORA$result),
  getNumSigPathways
))
numVerySigPathways <- unlist(lapply(
  list(verySigORA$result, veryUpRegORA$result, veryDownRegORA$result),
  getNumSigPathways
))

numGeneSetsDf <- data.frame("FDR < 0.05" = numSigPathways, 
                            "FDR < 0.001" = numVerySigPathways, check.names=F)
rownames(numGeneSetsDf) <- c("All DEGs", "Up-regulated", "Down-regulated")

summaryKable <- knitr::kable(
  numGeneSetsDf, type="html", 
  caption = paste("Number of significantly over-represented gene sets returned",
                  "by each ORA"))
kableExtra::kable_styling(summaryKable, "condensed")
Table 18: Number of significantly over-represented gene sets returned by each ORA
FDR < 0.05 FDR < 0.001
All DEGs 1773 1882
Up-regulated 1488 1570
Down-regulated 449 542

In all cases, the stricter-threshold ORA returned more gene sets than its weaker-threshold counterpart.


Discussion

The original study claimed that (1) UPR is generally up-regulated in MKD cells relative to normal kidney cells, and the ATF6 branch of UPR in particular is up-regulated; and (2) misfolded MUC1 proteins are trapped inside TMED9-containing vesicles in the early secretory pathway (between the ER and the Golgi), preventing misfolded Muc1 from being degraded in the lysosomes (Dvela-Levitt et al., 2019). I will investigate both of these claims, using the two helper functions below.

queryPathways <- function(queryTerm, oraList, significant=TRUE, 
                          caseSensitive=FALSE) {
  # initialize pathways data.frame
  pathways <- as.data.frame(matrix(nrow=0, ncol=7))
  colnames(pathways) <- c("sig", "expr","source", "term_id", "term_name", 
                          "term_size", "p_value")
  for (ora in oraList) {
    # If significant option is set, remove any non-sig gene sets
    if (significant) {
      geneSets <- ora$result[ora$result$significant, ]
    } else {
      geneSets <- ora$result
    }
    
    # Get all pathways with the query term from ORA results
    if (any(grepl(queryTerm, geneSets$term_name, ignore.case=T))) {
      df <- geneSets[grepl(queryTerm, geneSets$term_name, ignore.case=T), ]
      df$sig <- ora$sig
      df$expr <- paste(ora$expr, "DEGs")
      pathways <- rbind(pathways, df[, colnames(pathways)])
    }
  }
  
  # Sort pathways by pathways ID, DEG expression change, and DEG threshold
  pathways$expr <- factor(pathways$expr)
  pathways <- pathways[with(pathways, 
                            order(term_id, -expr, sig)), ]
  
  # Remove duplicate pathways, with preference to up/down regulated or 
  # stricter threshold (FDR < 0.001)
  pathways <- pathways[!duplicated(pathways$term_id), ]

  # Print table
  cap <- paste0("Pathways containing the term '", queryTerm, "'")
  if (significant) {
    cap <- paste(cap, 
                 "that are significantly over-represented in any of the ORAs.")
  }
  pathwaysKable <- knitr::kable(
    pathways, 
    col.names = c("DEG Sig. Threshold", "DEG Expr. Change", 
                  "Annotation Dataset", "Gene Set ID","Gene Set Name", 
                  "Gene Set Size", "P-Value"),
    caption=cap, row.names = FALSE, type="html", digits=42)
  kableExtra::kable_styling(pathwaysKable, "condensed")
}

UPR

First, I will search all the ORA results for significantly regulated pathways that have the terms “UPR,” “ATF6,” “IRE1,” or “PERK” in their name (ATF6, IRE1, and PERK are the three branches of UPR).

queryPathways("UPR", allOras, significant = T, caseSensitive = T)
Table 19: Pathways containing the term ‘UPR’ that are significantly over-represented in any of the ORAs.
DEG Sig. Threshold DEG Expr. Change Annotation Dataset Gene Set ID Gene Set Name Gene Set Size P-Value
FDR < 0.001 Up-regulated DEGs GO:BP GO:0097435 supramolecular fiber organization 496 0.0001013347
FDR < 0.001 Up-regulated DEGs GO:BP GO:1902903 regulation of supramolecular fiber organization 246 0.0123923493
FDR < 0.001 Up-regulated DEGs GO:BP GO:1902904 negative regulation of supramolecular fiber organization 106 0.0297628916
FDR < 0.05 All DEGs DEGs GO:BP GO:1902905 positive regulation of supramolecular fiber organization 117 0.0445919054
queryPathways("ATF6", allOras, significant = T)
Table 20: Pathways containing the term ‘ATF6’ that are significantly over-represented in any of the ORAs.
DEG Sig. Threshold DEG Expr. Change Annotation Dataset Gene Set ID Gene Set Name Gene Set Size P-Value
queryPathways("IRE1", allOras, significant = T)
Table 21: Pathways containing the term ‘IRE1’ that are significantly over-represented in any of the ORAs.
DEG Sig. Threshold DEG Expr. Change Annotation Dataset Gene Set ID Gene Set Name Gene Set Size P-Value
queryPathways("PERK", allOras, significant = T)
Table 22: Pathways containing the term ‘PERK’ that are significantly over-represented in any of the ORAs.
DEG Sig. Threshold DEG Expr. Change Annotation Dataset Gene Set ID Gene Set Name Gene Set Size P-Value


Tables 19 - 22 reveal that there are no significantly enriched pathways containing the terms “UPR,” “ATF6,” “IRE1,” or “PERK” in their name, respectively. As a sanity check, I can turn off the significance requirement when querying pathways.


queryPathways("UPR", allOras, significant = F, caseSensitive = T)
Table 23: Pathways containing the term ‘UPR’
DEG Sig. Threshold DEG Expr. Change Annotation Dataset Gene Set ID Gene Set Name Gene Set Size P-Value
FDR < 0.001 Up-regulated DEGs GO:BP GO:0097435 supramolecular fiber organization 496 0.0001013347
FDR < 0.001 Up-regulated DEGs GO:BP GO:1902903 regulation of supramolecular fiber organization 246 0.0123923493
FDR < 0.001 Up-regulated DEGs GO:BP GO:1902904 negative regulation of supramolecular fiber organization 106 0.0297628916
FDR < 0.001 Up-regulated DEGs GO:BP GO:1902905 positive regulation of supramolecular fiber organization 117 0.3463039114
FDR < 0.001 Up-regulated DEGs REAC REAC:R-HSA-381119 Unfolded Protein Response (UPR) 86 0.5312205229
FDR < 0.001 Up-regulated DEGs WP WP:WP4479 Supression of HMGB1 mediated inflammation by THBD 7 0.1073607927
queryPathways("ATF6", allOras, significant = F)
Table 24: Pathways containing the term ‘ATF6’
DEG Sig. Threshold DEG Expr. Change Annotation Dataset Gene Set ID Gene Set Name Gene Set Size P-Value
FDR < 0.001 Down-regulated DEGs GO:BP GO:0036500 ATF6-mediated unfolded protein response 7 0.5948911
FDR < 0.001 Down-regulated DEGs REAC REAC:R-HSA-381033 ATF6 (ATF6-alpha) activates chaperones 11 0.9436722
FDR < 0.001 Down-regulated DEGs REAC REAC:R-HSA-381183 ATF6 (ATF6-alpha) activates chaperone genes 9 0.9112982


Tables 23 - 24 prove that there are pathways associated with general UPR and ATF6-mediated UPR, but that none of those pathways are significantly regulated. They all have extremely large p-values (> 0.5), and therefore contradict the results from the original study.


TMED9, ER-Golgi Vesicle Transport

Next, I will look for pathways related to vesicle transport between the ER and the Golgi, specifically TMED9-containing vesicles.

queryPathways("TMED9", allOras, significant = TRUE)
Table 25: Pathways containing the term ‘TMED9’ that are significantly over-represented in any of the ORAs.
DEG Sig. Threshold DEG Expr. Change Annotation Dataset Gene Set ID Gene Set Name Gene Set Size P-Value
queryPathways("vesicle", allOras, significant = TRUE)
Table 26: Pathways containing the term ‘vesicle’ that are significantly over-represented in any of the ORAs.
DEG Sig. Threshold DEG Expr. Change Annotation Dataset Gene Set ID Gene Set Name Gene Set Size P-Value
FDR < 0.001 Down-regulated DEGs GO:BP GO:0016192 vesicle-mediated transport 966 0.01848052
FDR < 0.001 Up-regulated DEGs GO:BP GO:0060627 regulation of vesicle-mediated transport 302 0.01978143
FDR < 0.001 Up-regulated DEGs WP WP:WP4300 Extracellular vesicles in the crosstalk of cardiac cells 14 0.01477859
queryPathways("Golgi", allOras, significant = TRUE)
Table 27: Pathways containing the term ‘Golgi’ that are significantly over-represented in any of the ORAs.
DEG Sig. Threshold DEG Expr. Change Annotation Dataset Gene Set ID Gene Set Name Gene Set Size P-Value
FDR < 0.05 All DEGs DEGs REAC REAC:R-HSA-6811436 COPI-independent Golgi-to-ER retrograde traffic 42 0.02258587
FDR < 0.001 Up-regulated DEGs REAC REAC:R-HSA-975576 N-glycan antennae elongation in the medial/trans-Golgi 19 0.01299464


The search for TMED9 returned no results (Table 25); however this is expected since TMED9 is simply a protein-binding cargo receptor which likely does not have any pathways named after it.

The other two queries did return results. Table 26 shows three pathways from GO:BP and WikiPathways that are significantly enriched in at least one of the thresholded gene lists, and Table 27 shows two pathways from Reactome that are significantly enriched. The results suggest that vesicle-mediated transport is down-regulated and that there is differential regulation of COPI-independent Golgi-to-ER retrograde traffic (although it is unclear whether that pathway is up- or down-regulated). The original study shows that anterograde trafficking from the Golgi back to the ER is necessary for the lysosomal degradation and clearing of misfolded Muc1, and claims that interference in this early secretory pathways causes the progression of MKD (Dvela-Levitt et al., 2019). Thus, my ORA results align well with the study’s results: down-regulation of vesicle-mediated transport and potential down-regulation of retrograde transport would both greatly hinder the secretory pathway, thus preventing accumulated misfolded Muc1 from being transported to the lysosome for degradation.


ER Stress

Unfortunately, there are very few publications about MKD. This is partly because MKD is difficult to diagnose and differentiate from other genetic kidney disease; it is caused by a subtle mutation that is hard to identify and it has no symptoms except for tubular fibrosis in the kidneys (Blumenstiel et al., 2016; Eckardt et al., 2015). All of these factors make MKD challenging to study, so the field of evidence related to this disease is extremely sparse.

However, a review article by Sho Hasegawa and Reiko Inagi discusses the role of UPR as well as many other pathways involved in kidney disease (Hasegawa & Inagi, 2020). Specifically, Hasegawa and Inagi explain the causes and responses of ER stress in kidney cells (Hasegawa & Inagi, 2020). A toxic proteinopathy like MKD is a form of ER stress (Dvela-Levitt et al., 2019); therefore, pathways related to ER stress should be enriched in MKD cells. This turns out to be the case for this RNASeq dataset.

First, Hasegawa and Inagi state that oxidative stress and hypoxia can cause pathogenic ER stress in the kidneys (Hasegawa & Inagi, 2020). Table 28 reveals that eight hypoxia-related gene sets are significantly enriched in one of the thresholded gene lists. Based on Table 28 alone, it is unclear whether cellular response to hypoxia is up-regulated or down-regulated in MKD cells: the GO results suggest down-regulation while the Reactome results suggest up-regulation.

Second, Hasegawa and Inagi claim that ER stress in tubular cells can cause tubular apoptosis (Hasegawa & Inagi, 2020). Apoptosis was one of the most-significantly up-regulated pathways in our ORA tables (Tables 7, 11). Similarly, Table 29 shows that apoptosis-related pathways in three different annotation datasets (KEGG, Reactome, WikiPathways) are highly up-regulated in MKD cells.

Finally, Hasegawa and Inagi discuss how during the adaptive phase of ER stress, crosstalk between mitochondria and the ER result in increased calcium uptake (Hasegawa & Inagi, 2020). Table 30 shows that many calcium ion transport pathways from GO:BP and KEGG are significantly up-regulated in MKD cells.

Combining all of these observations, it is clear that many of the pathways up-regulated in MKD cells are related to ER stress. Although these relationships are indirect (e.g., the pathways reveal symptoms of ER stress), they are abundant and significant. One would expect MKD cells to be strongly enriched in UPR, but also enriched in pathways related to ER stress. My ORA results do not support the former of those hypotheses, but they do support the latter. Therefore, the results of my over-representation analysis are only partly supported by existing publications.

queryPathways("hypoxia", allOras, significant = TRUE)
Table 28: Pathways containing the term ‘hypoxia’ that are significantly over-represented in any of the ORAs.
DEG Sig. Threshold DEG Expr. Change Annotation Dataset Gene Set ID Gene Set Name Gene Set Size P-Value
FDR < 0.001 Down-regulated DEGs GO:BP GO:0001666 response to hypoxia 135 4.774269e-11
FDR < 0.001 All DEGs GO:BP GO:0061419 positive regulation of transcription from RNA polymerase II promoter in response to hypoxia 5 3.172140e-02
FDR < 0.001 Down-regulated DEGs GO:BP GO:0071456 cellular response to hypoxia 85 1.885910e-07
FDR < 0.001 All DEGs GO:BP GO:1900038 negative regulation of cellular response to hypoxia 7 2.216487e-02
FDR < 0.001 All DEGs GO:BP GO:1990144 intrinsic apoptotic signaling pathway in response to hypoxia 5 3.172140e-02
FDR < 0.001 Down-regulated DEGs REAC REAC:R-HSA-1234158 Regulation of gene expression by Hypoxia-inducible Factor 9 6.885987e-03
FDR < 0.001 Up-regulated DEGs REAC REAC:R-HSA-1234174 Cellular response to hypoxia 70 1.168336e-04
FDR < 0.001 Up-regulated DEGs REAC REAC:R-HSA-1234176 Oxygen-dependent proline hydroxylation of Hypoxia-inducible Factor Alpha 62 6.982746e-05
queryPathways("apoptosis", allOras, significant = TRUE)
Table 29: Pathways containing the term ‘apoptosis’ that are significantly over-represented in any of the ORAs.
DEG Sig. Threshold DEG Expr. Change Annotation Dataset Gene Set ID Gene Set Name Gene Set Size P-Value
FDR < 0.001 Up-regulated DEGs KEGG KEGG:04210 Apoptosis 115 5.591527e-03
FDR < 0.001 Up-regulated DEGs REAC REAC:R-HSA-109581 Apoptosis 161 8.001187e-08
FDR < 0.05 Up-regulated DEGs REAC REAC:R-HSA-109606 Intrinsic Pathway for Apoptosis 53 3.315779e-02
FDR < 0.001 Up-regulated DEGs REAC REAC:R-HSA-169911 Regulation of Apoptosis 50 3.062822e-06
FDR < 0.001 Up-regulated DEGs WP WP:WP1772 Apoptosis modulation and signaling 82 2.129429e-02
FDR < 0.001 Up-regulated DEGs WP WP:WP254 Apoptosis 76 1.311449e-02
queryPathways("calcium", allOras, significant = TRUE)
Table 30: Pathways containing the term ‘calcium’ that are significantly over-represented in any of the ORAs.
DEG Sig. Threshold DEG Expr. Change Annotation Dataset Gene Set ID Gene Set Name Gene Set Size P-Value
FDR < 0.001 Up-regulated DEGs GO:BP GO:0006816 calcium ion transport 166 0.0297733275
FDR < 0.001 Up-regulated DEGs GO:BP GO:0006874 cellular calcium ion homeostasis 172 0.0005372135
FDR < 0.001 Up-regulated DEGs GO:BP GO:0007204 positive regulation of cytosolic calcium ion concentration 106 0.0033734356
FDR < 0.001 Up-regulated DEGs GO:BP GO:0010522 regulation of calcium ion transport into cytosol 46 0.0015475911
FDR < 0.05 Up-regulated DEGs GO:BP GO:0010880 regulation of release of sequestered calcium ion into cytosol by sarcoplasmic reticulum 15 0.0253346820
FDR < 0.001 All DEGs GO:BP GO:0014808 release of sequestered calcium ion into cytosol by sarcoplasmic reticulum 17 0.0374391277
FDR < 0.001 All DEGs GO:BP GO:0016339 calcium-dependent cell-cell adhesion via plasma membrane cell adhesion molecules 12 0.0168105886
FDR < 0.001 Down-regulated DEGs GO:BP GO:0017156 calcium-ion regulated exocytosis 30 0.0307734435
FDR < 0.001 Up-regulated DEGs GO:BP GO:0019722 calcium-mediated signaling 92 0.0300799502
FDR < 0.05 Up-regulated DEGs GO:BP GO:0035584 calcium-mediated signaling using intracellular calcium source 13 0.0395763912
FDR < 0.001 All DEGs GO:BP GO:0050849 negative regulation of calcium-mediated signaling 10 0.0268160792
FDR < 0.001 Up-regulated DEGs GO:BP GO:0051208 sequestering of calcium ion 58 0.0312016922
FDR < 0.001 Up-regulated DEGs GO:BP GO:0051209 release of sequestered calcium ion into cytosol 53 0.0158892093
FDR < 0.001 Up-regulated DEGs GO:BP GO:0051279 regulation of release of sequestered calcium ion into cytosol 39 0.0036915856
FDR < 0.001 Up-regulated DEGs GO:BP GO:0051281 positive regulation of release of sequestered calcium ion into cytosol 23 0.0413259942
FDR < 0.001 Up-regulated DEGs GO:BP GO:0051282 regulation of sequestering of calcium ion 55 0.0212373947
FDR < 0.001 Up-regulated DEGs GO:BP GO:0051283 negative regulation of sequestering of calcium ion 54 0.0186149059
FDR < 0.001 Up-regulated DEGs GO:BP GO:0051480 regulation of cytosolic calcium ion concentration 116 0.0001524332
FDR < 0.001 All DEGs GO:BP GO:0051481 negative regulation of cytosolic calcium ion concentration 3 0.0390109706
FDR < 0.001 All DEGs GO:BP GO:0051592 response to calcium ion 68 0.0046771321
FDR < 0.001 Up-regulated DEGs GO:BP GO:0051924 regulation of calcium ion transport 103 0.0053061602
FDR < 0.001 Up-regulated DEGs GO:BP GO:0051928 positive regulation of calcium ion transport 51 0.0119075304
FDR < 0.001 Up-regulated DEGs GO:BP GO:0055074 calcium ion homeostasis 175 0.0003510758
FDR < 0.001 Up-regulated DEGs GO:BP GO:0060314 regulation of ryanodine-sensitive calcium-release channel activity 18 0.0024371653
FDR < 0.001 Up-regulated DEGs GO:BP GO:0060401 cytosolic calcium ion transport 86 0.0332778773
FDR < 0.001 Up-regulated DEGs GO:BP GO:0060402 calcium ion transport into cytosol 70 0.0050229025
FDR < 0.05 All DEGs DEGs GO:BP GO:0070296 sarcoplasmic reticulum calcium ion transport 19 0.0166465155
FDR < 0.001 Up-regulated DEGs GO:BP GO:0070509 calcium ion import 35 0.0440922141
FDR < 0.001 All DEGs GO:BP GO:0070588 calcium ion transmembrane transport 125 0.0077475354
FDR < 0.001 All DEGs GO:BP GO:0071277 cellular response to calcium ion 43 0.0168355779
FDR < 0.001 Up-regulated DEGs GO:BP GO:0090279 regulation of calcium ion import 16 0.0267657284
FDR < 0.001 Up-regulated DEGs GO:BP GO:0097553 calcium ion transmembrane import into cytosol 64 0.0051415712
FDR < 0.001 All DEGs GO:BP GO:0098703 calcium ion import across plasma membrane 10 0.0046871685
FDR < 0.001 All DEGs GO:BP GO:1902656 calcium ion import into cytosol 10 0.0046871685
FDR < 0.001 Up-regulated DEGs GO:BP GO:1903169 regulation of calcium ion transmembrane transport 71 0.0297733275
FDR < 0.001 All DEGs GO:BP GO:1903514 release of sequestered calcium ion into cytosol by endoplasmic reticulum 17 0.0374391277
FDR < 0.001 All DEGs GO:BP GO:1990927 calcium ion regulated lysosome exocytosis 3 0.0390109706
FDR < 0.001 Up-regulated DEGs KEGG KEGG:04020 Calcium signaling pathway 119 0.0084232981
FDR < 0.05 All DEGs DEGs KEGG KEGG:04961 Endocrine and other factor-regulated calcium reabsorption 36 0.0156409145


Conclusion

The common patterns I saw in the over-representation analysis of GSE129943 was an up-regulation of pathways related to viral response and hypoxia, and a down-regulation of pathways related to protein translation and vesicle-mediated transport. These results partly aligned with the original study (Dvela-Levitt et al., 2019) and were partly supported by other publications. A fuller gene set enrichment analysis is required to investigate the discrepancies between my ORA results and existing literature on MKD.

References

Adamson, B., Norman, T. M., Jost, M., Cho, M. Y., Nuñez, J. K., Chen, Y., Villalta, J. E., Gilbert, L. A., Horlbeck, M. A., Hein, M. Y.others. (2016). A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell, 167(7), 1867–1882.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T.others. (2000). Gene ontology: Tool for the unification of biology. Nature Genetics, 25(1), 25–29.
Blumenstiel, B., DeFelice, M., Birsoy, O., Bleyer, A. J., Kmoch, S., Carter, T. A., Gnirke, A., Kidd, K., Rehm, H. L., Ronco, L., Lander, E. S., Gabriel, S., & Lennon, N. J. (2016). Development and validation of a mass spectrometry–based assay for the molecular diagnosis of mucin-1 kidney disease. The Journal of Molecular Diagnostics, 18(4), 566–571. https://doi.org/https://doi.org/10.1016/j.jmoldx.2016.03.003
Boettiger, C., & Eddelbuettel, D. (2017). An Introduction to Rocker: Docker Containers for R. The R Journal, 9(2), 527–536. https://doi.org/10.32614/RJ-2017-065
Chen, Y., Lun, A. T., & Smyth, G. K. (2016). From reads to genes to pathways: Differential expression analysis of RNA-seq experiments using rsubread and the edgeR quasi-likelihood pipeline. F1000Research, 5.
Dvela-Levitt, M., Kost-Alimova, M., Emani, M., Kohnert, E., Thompson, R., Sidhom, E.-H., Rivadeneira, A., Sahakian, N., Roignot, J., Papagregoriou, G.others. (2019). Small molecule targets TMED9 and promotes lysosomal degradation to reverse proteinopathy. Cell, 178(3), 521–535.
Eckardt, K.-U., Alper, S. L., Antignac, C., Bleyer, A. J., Chauveau, D., Dahan, K., Deltas, C., Hosking, A., Kmoch, S., Rampoldi, L., Wiesener, M., Wolf, M. T., & Devuyst, O. (2015). Autosomal dominant tubulointerstitial kidney disease: Diagnosis, classification, and management—a KDIGO consensus report. Kidney International, 88(4), 676–683. https://doi.org/https://doi.org/10.1038/ki.2015.28
Gillespie, M., Jassal, B., Stephan, R., Milacic, M., Rothfels, K., Senff-Ribeiro, A., Griss, J., Sevilla, C., Matthews, L., Gong, C.others. (2022). The reactome pathway knowledgebase 2022. Nucleic Acids Research, 50(D1), D687–D692.
Gu, Z., Eils, R., & Schlesner, M. (2016). Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics.
Gu, Z., Gu, L., Eils, R., Schlesner, M., & Brors, B. (2014). Circlize implements and enhances circular visualization in r. Bioinformatics, 30, 2811–2812.
Hasegawa, S., & Inagi, R. (2020). Organelle stress and crosstalk in kidney disease. Kidney360, 1(10), 1157–1164.
Kanehisa, M. (2019). Toward understanding the origin and evolution of cellular organisms. Protein Science, 28(11), 1947–1951.
Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M., & Tanabe, M. (2021). KEGG: Integrating viruses and cellular organisms. Nucleic Acids Research, 49(D1), D545–D551.
Kanehisa, M., & Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1), 27–30.
Kim, H.-Y. (2017). Statistical notes for clinical researchers: Chi-squared test and fisher’s exact test. Restorative Dentistry & Endodontics, 42(2), 152–155.
Kolberg, L., Raudvere, U., Kuzmin, I., Vilo, J., & Peterson, H. (2020). gprofiler2– an r package for gene list functional enrichment analysis and namespace conversion toolset g:profiler. F1000Research, 9 (ELIXIR)(709).
Martens, M., Ammar, A., Riutta, A., Waagmeester, A., Slenter, D. N., Hanspers, K., A. Miller, R., Digles, D., Lopes, E. N., Ehrhart, F.others. (2021). WikiPathways: Connecting communities. Nucleic Acids Research, 49(D1), D613–D621.
McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288–4297.
Merkel, D. (2014). Docker: Lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2.
R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Raudvere, U., Kolberg, L., Kuzmin, I., Arak, T., Adler, P., Peterson, H., & Vilo, J. (2019). G: Profiler: A web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Research, 47(W1), W191–W198.
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140.
Sievert, C. (2020). Interactive web-based data visualization with r, plotly, and shiny. Chapman; Hall/CRC. https://plotly-r.com
The gene ontology resource: Enriching a GOld mine. (2021). Nucleic Acids Research, 49(D1), D325–D334.
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org
Xie, Y. (2013). Knitr: A general-purpose tool for dynamic report generation in r. R Package Version, 1(1).
Xie, Y. (2017). Dynamic documents with r and knitr. Chapman; Hall/CRC.
Xie, Y. (2018). Knitr: A comprehensive tool for reproducible research in r. In Implementing reproducible research (pp. 3–31). Chapman; Hall/CRC.
Zhu, H. (2021). kableExtra: Construct complex table with ’kable’ and pipe syntax.